Lexical Scanner
A lexical scanner is used to sort out a character stream into tokens. The tokens represent different types of groups and/or specific representation. All compilers use some sort of lexical scanner for reading the file/character-stream. An example problem that needs to be solved with a lexical scanner: A 2D computer-game with a dynamic environment that has a lot of events in it. We need to avoid hardcoding the events into the source and separate the actions in an event from the actual engine. A normal solution is to have files on the local filesystem which is loaded by the software and then executed at a later stage or at loadtime (during parsing). The first thing that allows us to parse the file is the lexical scanner (can be done without it but it will become more static). The scanner is designed to take either a filepointer or a filestream into account and then wait for commands to start reading from it. The key components of a lexical scanner is: :Scanning ::Scans the current stream and groups together a set of characters (if more then one) :Error-handling ::The scanner can throw an exception or exit the software where a certian token was expected :Matches ::Match a certain token and/or specific value which can throw an error if needed. Example A simple lexical-scanner draft. This scanner uses only 4 tokens for strings, numerics, characters and one for end-of-file (which is used for visual aid). This design is not the best to use since one should use reuse code where possible. One could create an interface and a superclass that implements most common scanner-functionallity. This allows additions for reading direct from compressed streams where the data is uncompressed on-the-fly and the compressed file/stream is returned as tokens. A scanner can handle all the tokens which is the default. Such tokens as operators, keywords and so on. This scanner on the other hand only groups numerics, strings and characters. Note that this interface is written with a pseudo-C++ code showing how inheritance and interfaces can play a part of the scanner structure. You can write the scanner in any language you want where you could use the builtin functions in the language instead of your own isAlpha(), isNumerc(), etc, functions. Languages with builtin/extensions for character recognition: *C/C++: (ISO C99) *PHP: ctype extension (see php manual) /* alpha characters: T_STRING */ #define CHARS_ALPHA "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" /* numeric characters: T_NUMERIC */ #define CHARS_DIGIT "0123456789" /* tokens */ #define T_EOF 0 #define T_STRING 1 #define T_NUMERIC 2 #define T_CHARACTER 3 /* lexical scanner interface */ class Scanner_Interface /* end-of-stream ? */ bool isEof() /* check for alpha character */ bool isAlpha() /* check for numeric character */ bool isNumeric() /* check for alpha-numeric character */ bool isAlphaNumeric() /* match current value with a string, throws exception */ void match(char *) /* match current token, throws exception */ void matchToken(token_t) /* get an alpha-value, throw exception if the current character * was'nt an alpha character. */ void getAlpha() /* get a numeric-value, throw exception if the current character * was'nt a numeric character. */ void getNumeric() /* get the current token */ token_t getToken() /* write the value to the pointer given. * could also return a const char * to local value, */ void getAlphaValue(char *) /* return the value as a signed integer */ int getNumericValue() /* read next character in stream */ void readNext() /* perform a scan */ void scan() /* a lexical scanner that uses a characterstream. * fast since it has a buffer in memory to read from at the expense of memory. */ class Scanner_CharStream extends Scanner_Interface /* make a local copy of the data passed to us */ char * mStream /* current character being processed */ char mCurrent /* last value found in a scan or getAlpha/Numeric() */ char mValue [ 512 ] /* token from the last scan or getAlpha/Numeric() */ token_t mToken /* constructor */ Scanner( const char * data ); /* interface implementation */ /* a lexical scanner that uses a filepointer. * saves memory at the expense of speed. */ class Scanner_FileStream extends Scanner_Interface FILE * mFilePointer char mCurrent char mValue [ 512 ] token_t mToken Scanner( FILE * fp ) /* interface implementation */ Now for some internal workings with the use of pseudo-code: function Scanner :: match (matchString) if not mString equal matchString then throw ScannerException ("expected: '" + matchString + "'") '' getAlpha and getNumeric is similar '' getNumeric calls isNumeric() only instead function Scanner :: getAlpha () '' check first character that it is an alphacharacter if not isAlpha () then throw ScannerException ("expected T_STRING") '' in VB and/or PHP clear the valuestring first '' read alphanumerics until not found anymore while not isAlphaNumeric () do '' build the string mValue = mValue + mCurrent readNext () '' in C/C++ end the string with NULL-character mToken = T_STRING function Scanner :: scan () '' get alpha if isAlpha () then getAlpha () '' get numeric else if isNumeric() then getNumeric() '' otherwise just set it as a character '' readNext checks if eof has occured else '' check for eof incase the isEof() is'nt used if isEof () then throw ScannerException ("unexpected end-of-file") mValue = mCharacter '' C/C++: add NULL-char readNext () function Scanner :: readNext () '' check for eof, if so then reset variables so they cant be used if isEof () then mCharacter = "" mToken = T_EOF mValue = NULL '' read the character at the current position in stream mCharacter = mValue [ current position in stream ] '' increment position current position in stream += 1 This covers most of the basics for setting up a lexical scanner. You dont have to write your own lexical scanner though. There are some splendid scanner-generators out there which can be used for free. See the links list. External links *GNU BISON *Fast Lexical Analyzer *Lex & Yacc (Yet Another Compiler Compiler)